class: center, middle, inverse, title-slide .title[ # Class 2a: Review of concepts in Probability and Statistics ] .author[ ### Business Forecasting ] --- <style type="text/css"> .remark-slide-content { font-size: 20px; } </style> ### Summary - In the last class: - We discussed the organization of the course - We overviewed forecasting methods - We learned about methods of qualitative forecasting - *Reference:* Forecasting Methods and Applications, chapter 1 - This set of classes: - We will start learning about exploratory analysis preparing the forecast - We will learn about various **data types** - We will learn how to **summarize data graphically** - We will learn how to **summarize data with summary statistics** - We will learn about **comparisons and associations** - *Reference:* Forecasting Methods and Applications, chapter 2.1-2.4 --- ### Scenario - Nowadays, many online pharmacies appeared which write prescriptions and make drugs subscriptions - Example in Mexico: *Choiz* -- - At the same time, a new wave of very effective anti-diabetes drugs appeared which help to lose weight - Example: *Ozempik* -- - You are consulting a business which wants to provide subscription services for these drugs in Mexico -- - Your boss asks you to do exploratory market research for potential sales forecast --- ### Parameters vs Statistics - You need to know how many people in Mexico have diabetes - Call `\(\mu_d\)` the proportion of Mexican population which has diabetes - This is a parameter that you want to learn - But you don't have data on the whole population. At best you can get a sample from a survey - So you will try to estimate this parameter with sample - You will calculate a statistic `\(\hat{\mu}_p\)` which is the proportion of diabetics in the sample --- layout: false class: inverse, middle # Types of Data --- ### Longitudinal Data - Observations are collected for the same subject over a period of time - Example: Tracking a company's annual revenue and number of employees over several years #### Longitudinal Data Example
--- ### Cross-Sectional Data - Observations are collected at a single point in time - Example: A survey of customers' satisfaction with a product and likelihood of repurchase at a certain point in time #### Cross-Sectional Data Example
--- ### Panel Data - Combines both longitudinal and cross-sectional data - Observations are collected for multiple subjects over multiple points in time - Example: Tracking the annual revenue and number of employees of several companies over a few years #### Panel Data Example
--- ## Q1
-- **Panel data** -Multiple observation per subject (currency) --- ## Q2
-- **Cross-sectional data** -Single observation per subject (user) --- ## Q3
-- **Longitudinal data** -Multiple observations of a single subject --- layout: false class: inverse, middle ## Variable Types --- ## Variable Types We have two general types: .blue[Categorical] and .blue[Numerical] variables ### Categorical Variables - Variables that can be divided into one or more groups or categories. - **Ordinal:** These variables can be logically ordered or ranked. - *Variable:* Customer Satisfaction Survey Results - *Example:* Very Unsatisfied, Unsatisfied, Neutral, Satisfied, Very Satisfied - **Nominal:** These variables cannot be ordered or ranked. - *Variable:* Social Media Platforms Used - *Example:* Facebook, Instagram, Twitter, LinkedIn, TikTok, Snapchat --- ### Numerical Variables - Variables that hold numeric data. - **Discrete:** These variables can only take certain values - *Example*: Number of App Downloads from App Store - *Example*: Number of children you have - **Continuous:** These variables can take any value within a range - *Example*: Time (in seconds) Spent on a Webpage - *Example*: Exchange rate between MXN and USD --- ### Mexican Health Survey Representative sample of the Mexican population .red[n=37858]
-- - *Age*: Numerical, Discrete -- - *Gender*: Categorical, Nominal -- - *Weight*: Numerical, Continuous -- - *Location_type*: Categorical, Nominal -- - *Diabetes*: Categorical, Nominal -- - *Mother_diabetes*: Categorical, Nominal -- - *Difficulty_walking*: Categorical, Ordinal --- layout: false class: inverse, middle # Summarizing Data ## Graphical summaries --- ### Categorical variables ###Frequency Tables **Frequency table**: present the absolute frequencies (counts) and relative frequencies (shares) of each category. - Relative frequency of category `\(i\)`: `\(p_i=\frac{n_i}{N}\)` - `\(n_i\)` is count of category `\(i\)` - `\(N\)` is total count in the sample .pull-left[
] .pull-right[
] --- ###Bar Charts **Bar charts** visually represents the frequency count of each category .center[ <!-- --> ] --- ###Bar Charts **Bar charts** visually represents the frequency count of each category .center[ <!-- --> ] --- ### More Creative Bar Chart <center> <img src=Bar_chart_food_poisoning.png width="800"> </center> --- ###Pie Charts **Pie chart**: Each slice is proportional to the category's frequency .center[
] --- ###Pie Charts **Pie chart**: (Angle of) Each slice is proportional to the category's frequency .center[
] --- ### My favorite pie chart <center> <img src=Netflix_pie_chart.jpg width="800"> </center> --- ###Treemaps **Treemap**: each group is represented by a rectangle, which area is proportional to its value. .pull-left[ #### Data
] .pull-right[ ####Treemap
] --- ### Numerical variables: Discrete **Dotplot**: present one dot for each observation. Stacks observation of similar value - Clearly see the distribution and the outliers - Useless for larger data
.center[ <!-- --> ] --- ### Numerical variables: Discrete .center[ <!-- --> ] --- ### Frequency Distribution Suppose we survey people age 30-50 how many partners they had in their life. - What's the distribution of partners? - Calculate relative frequencies - Show them on a bar graph .pull-left[ #### Data
] .pull-right[ ####Distribution <img src="data:image/png;base64,#C_2_slides_a_files/figure-html/unnamed-chunk-20-1.png" width="100%" /> ] --- ### Frequency Distribution We can also show frequency of age of people who have diabetes from our data <img src="data:image/png;base64,#C_2_slides_a_files/figure-html/unnamed-chunk-21-1.png" width="100%" /> --- ### Frequency Distribution Compare it to the age distribution in the adult population (20+) <img src="data:image/png;base64,#C_2_slides_a_files/figure-html/unnamed-chunk-22-1.png" width="100%" /> --- ## Numerical Variables: Continuous - What about continuous values? Why can't we do the same? .pull-left[ <!-- --> ] .pull-right[
] - Most values never repeat, so they have very low relative frequency --- ## Histograms **Solution**: Group similar values together - Construct intervals and show how many observations are in a given interval -- **Process** 1. Decide how many intervals 2. And how wide they are 3. Then calculate the absolute and relative frequencies of each interval 4. Plot it with bars -- --- **My approach** - I want `\(k\)` (example `\(k\)`=5) equal intervals -- - Divide the range of the data into `\(k\)` equal intervals -- - *Range* is max-min of the data -- ```r # Calculate max and min max_value <- max(Health_data$weight) min_value <- min(Health_data$weight) # Calculate the difference range <- max_value - min_value ``` -- ``` ## [1] "Range= 190.8078 - 30.3745 = 160.4333" ``` -- - With 5 intervals, each will be 32kg wide -- - The first one starts at the minimum value (30.3745) -- - The last one ends at the maximum value (190.8078) -- - Calculate how many observations I have in each interval and what's the relative frequency --- ## Histograms - Midpoint represents middle of the interval - center of the bar - `\(P_i\)` is cumulative frequency: share of observations in this or smaller interval - *Example*: `\(P_{(62.46-94.55)}=0.911\)` - *Interpretation*: 91.1% of people have weight lower than 94.55kg .pull-left[ <!-- --> ] .pull-right[
] --- ## Histogram with 10 Classes Now, let's increase the number of classes to 10. .pull-left[ <!-- --> ] .pull-right[
] --- ## Histogram with 100 Classes .pull-left[ <!-- --> ] .pull-right[
] - Helps to see the distribution and outliers - Is more always better? - With smaller intervals, histogram tends to the **probability density function** --- ## Probability Density Function (PDF) ### Definition - **Probability Density Function (pdf)** describes the probability distribution of a continuous random variable. - The PDF represents the likelihood of the random variable taking on a specific value within a given range. - Unlike the probability mass function (PMF) for discrete variables, the PDF does not give the probability of exact values but provides the relative likelihood of the random variable being within a certain interval. -- ### Example Suppose we have a random variable X representing the weight of adults in Mexican population. The PDF of X would describe the likelihood of finding a person of a specific weight within a range (e.g., between 58kg and 60kg). --- ### How They Work To calculate the probability of X falling within a specific range [a, b], you need to integrate the PDF from a to b: `\(P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx\)` The total area under the PDF curve is equal to 1
--- ### How They Work To calculate the probability of X falling within a specific range [a, b], you need to integrate the PDF from a to b: `\(P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx\)` The total area under the PDF curve is equal to 1
--- ### How They Work To calculate the probability of X falling within a specific range [a, b], you need to integrate the PDF from a to b: `\(P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx\)` The total area under the PDF curve is equal to 1
--- ### How They Work To calculate the probability of X falling within a specific range [a, b], you need to integrate the PDF from a to b: `\(P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx\)` The total area under the PDF curve is equal to 1
--- ## Distribution Shapes: Modality <img src="data:image/png;base64,#C_2_slides_a_files/figure-html/unnamed-chunk-37-1.png" width="100%" /> --- ## Which is uniformaly distributed 1. weights of adult females 2. salaries of a random sample of people from CDMX 3. House prices in CDMX 4. birthdays of classmates (day of the month) --- ## Distribution Shapes: Modality <img src="data:image/png;base64,#C_2_slides_a_files/figure-html/unnamed-chunk-38-1.png" width="100%" /> --- ### Age at death <center> <img src=Age_at_death.jpeg width="800"> </center> --- ## We want to know how many people weight more than 100kg --- ## Cumulative Distribution Function (CDF) The .blue[Cumulative Distribution Function] (CDF) gives the probability that a continuous random variable X will take on a value less than or equal to a specific value x. For a continuous random variable X with PDF f(x), the CDF F(x) is defined as: `\(F(x) = \int_{-\infty}^{x} f(t) \, dt = P(X<x)\)` The CDF provides a cumulative view of the probability distribution and is useful for calculating probabilities over intervals and finding percentiles. - The CDF starts at 0 as probability of X being less than or equal to negative infinity is 0 - It approaches 1 as x approaches infinity since the the probability of X being less than or equal to positive infinity is 1. --- ## Example 1: Standard Normal `\(F(-2) = \int_{-\infty}^{-2} f(t) \, dt = P(X<-2)=0.02\)` <!-- --> --- ## Example 2: Standard Normal `\(F(0.2) = \int_{-\infty}^{0.2} f(t) \, dt = P(X<0.2)=0.58\)` <!-- --> --- ## Example 3: Standard Normal `\(F(3.2) = \int_{-\infty}^{3.2} f(t) \, dt = P(X<3.2)=0.99\)` <!-- --> --- ### Empirical CDF We can do similar thing with our weight data. `\(ECDF(x)=\frac{\sum I(w_i<x)}{N}=\frac{\text{Number of people with weight lower than x}}{N}\)` - `\(I(w_i<x)=1\)` if weight of person `\(i\)` is lower than `\(x\)` (*Indicator Function*) - `\(N\)` is total number of people (*Sample Size*) - Share of people with weight lower than x
--- ### So how do we calculate share of people with weight>100kg? -- `\(P(weight>100)=1-P(weight<100)=1-ECDF(100)\)`